feat: local review app, LoC+Zucker ingest, corpus audit (198 entries)#37
Merged
Conversation
…er/source views - Removed 164 entries flagged during corpus audit (209 remain) - Marked 20 orphaned source records as rejected - Review app: redesign home with clickable By Writer / By Source card grids - Review app: new /writer/<slug> and /source/<id> per-group entry views - Review app: show transcript status badge per entry (status + license if present) - Review app: audit page now shows transcript info per entry - Updated exports and README status (209 entries, 57 sources with entries) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- License breakdown bar: segmented colour bar + legend with counts and % - Key metric blocks: entries, sources, writers, date range, transcript count - Warn block shown if any entries have unclear rights - compute_corpus_stats() helper in app.py; license short-names + colour map Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
5 entries stored only a PDF file, which browsers can't display inline as an <img>. Used pdftoppm at 200 DPI to produce a JPEG thumbnail for each, added as role=thumbnail prepended in the files list so the review app picks it up immediately. Original PDF kept as role=original. Affected entries: - commons__auerbach_letter_shtenzel_1961__p0001 - commons__bendin_semichah_shtenzel_1933__p0001 - commons__weidenfeld_eruv_letter_1947__p0001 - commons__wosner_halachic_ruling_1981__p0001 - commons__wosner_support_letter_1990__p0001 Validation: 111 sources, 209 entries, 242 files verified. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Actions toggle (✎ Actions button in nav header): - Hidden by default; one click reveals flag button + comment textarea on every entry card across all views (home, group pages, audit) - State persists in localStorage; CSS-driven via body.show-actions class - Audit submit button also gated behind the toggle All Entries view (third tab on home page): - Flat scrollable grid of all 209 entries, same card style as group pages - Includes rights/transcript badges, lightbox zoom, action strips - Browse save bar (sticky bottom) appears when Actions are on; saves to the same /api/audit/decide endpoint and merges with existing decisions Group pages (writer/source): - Flag/comment action strips added to each entry card (hidden by default) - Floating browse save bar; loads + merges with existing audit decisions New API endpoint GET /api/audit/decisions for client-side merge before save Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Removed entries (all flagged via audit UI 2026-05-25): - commons__bodleian_geniza_ms_heb_d_41_4b__p0001 (Geniza fragment) - commons__bodleian_geniza_ms_heb_e_39_78b__p0001 (Geniza fragment) - commons__chief_rabbinate_letter_1921__p0001 - commons__chushiel_letter_geniza__p0001 (Geniza) - commons__damascus_pentateuch_ms_heb_8_7088__p0001 - commons__geniza_education_ts_k5_13__p0001 (Geniza) - commons__grodzinski_letter_about_kook__p0001 - commons__halper462_exilarch_genealogy__p0001 - loc__2024422570__p0003, p0004, p0005 9 now-orphaned source records marked rejected. Corpus: 198 entries across 111 sources (228 files). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- scripts/ingest_loc.py: paginate LoC Hebraic Manuscripts JSON API, filter pre-1700/printed items, download up to 5 pages per item, write data/review/loc_pending.jsonl - scripts/ingest_zucker.py: parse OPenn TEI manifests for the Zucker Ketubah Collection, write data/review/zucker_pending.jsonl - scripts/merge_review.py: promote approved review decisions into entries.jsonl + sources.jsonl; auto-creates per-item source records - scripts/review_app/requirements.txt: Flask dependency for review app - scripts/review_app/templates/batch.html: batch review UI (invert accept pattern — all dim by default, click to accept) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- data/review/loc_pending.jsonl: 722 entries staged from LoC Hebraic Manuscripts collection; 166 items, up to 5 pages each - data/review/loc_decisions.json: 722 decisions (27 approved, 695 rejected) - data/review/zucker_pending.jsonl: 288 entries staged from OPenn Zucker Ketubah Collection - data/review/zucker_decisions.json: 288 decisions (1 approved, 287 rejected) These files serve as the audit trail for the two review sessions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds only the scan directories for sources whose entries are in the corpus index (entries.jsonl). Rejected/unreviewed scan directories from the same download sessions remain untracked. Sources included: - 16 LoC Hebraic Manuscripts items (loc__2018757642 … loc__2023530858) accepted from the 166-item LoC review session (27 entries total) - openn__zucker__ket_z_238 — single accepted Zucker ketubah (Hebrew-text panel, CC-BY-SA 4.0) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…eview
- app.py full rewrite: extract _enrich_entries() helper (eliminates 4
copy-pasted enrichment blocks), add mtime-keyed module-level file
cache, add load_audit_decisions(live_ids) to filter stale decisions
at load time, fix path traversal in serve_scan() with resolve()+
relative_to() check, fix save_decisions() to merge-not-clobber via
existing.update(incoming), fix source_detail() 404 axis (check
source_id not in sources, not len(entries)==0), fix review_batch()
hardcoded source_id with primary_sid=max(set(...),key=count), fix
walrus-operator double-call in group thumb helpers, remove dead
imports (re, sys, datetime, timezone)
- templates: slim full-entry JSON blobs to ID-only arrays
(ENTRIES→ENTRY_IDS, ALL_ENTRIES→ALL_ENTRY_IDS) — eliminates ~588 KB
of tojson payload per page load; update save loops to iterate IDs
- group.html: remove dead fetch('/api/audit/status') try block in
saveDecisions() that preceded the real merge fetch
- data/review/audit_decisions.json: clear 175 stale decisions
(all referenced entries removed from corpus in audit passes)
- merge_review.py: prune stale IDs from audit_decisions.json after
each batch merge so the file stays in sync with the live index
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR captures a full session of data ingestion, corpus curation, and tooling work. It adds 198 verified entries to the corpus (net, after two audit passes) across two new sources, plus a complete local review application.
Data
New sources ingested
Corpus audit passes
Two post-ingest audit passes over all existing entries removed out-of-scope material:
Net corpus after all removals: 198 entries, 57 active sources, 228 files, ~382 MiB.
Scan files
Adds scan directories for the 17 accepted sources only (16 LoC + 1 Zucker, ~25 MiB). Rejected/unreviewed scan directories from the same download sessions remain untracked.
PDF → JPEG thumbnails
Five corpus entries that had only a PDF file (no renderable image) were fixed by running
pdftoppm -jpeg -r 200to produce a_thumb.jpgper entry and adding it asrole: thumbnailin the files list.Tooling
Ingest scripts
scripts/ingest_loc.py— paginates the LoC JSON API, filters pre-1700/printed items, downloads up to 5 pages per item, writesdata/review/loc_pending.jsonlscripts/ingest_zucker.py— parses OPenn TEI manifests for the Zucker collection, writesdata/review/zucker_pending.jsonlscripts/merge_review.py— promotes approved decisions intoentries.jsonl+sources.jsonl; auto-creates per-item source records so the entry-ID → source-ID constraint is always satisfiedLocal review app (
scripts/review_app/)A Flask app (port 5757) for human review of pending batches and the verified corpus. Run with
pip install flask && python scripts/review_app/app.py.Home page — two/three-way view toggle:
Corpus stats dashboard (top of home page):
Per-writer / per-source detail pages:
Corpus audit page (
/audit):data/review/audit_decisions.jsonGlobal ✎ Actions toggle (nav header, all pages):
localStorage; same/api/audit/decideendpoint used everywhereBatch review UI (
/review/<batch_id>):Documentation
AGENTS.md: tightened corpus scope — 18th century minimum, cursive כתב יד only, Yiddish in Hebrew script in scope, Judeo-Arabic out of scopedocs/sources/wikimedia_queue.md: updated Wikimedia queue logREADME.md, exports,NOTICE.md,CITATION.cff,datapackage.json: regenerated from current index🤖 Generated with Claude Code